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Abstract 

Cascaded regression has been recently applied to recon¬ 
structing 3D faces from single 2D images directly in shape 
space, and achieved state-of-the-art performance. This pa¬ 
per investigates thoroughly such cascaded regression based 
3D face reconstruction approaches from four perspectives 
that are not well studied yet: (i) The impact of the num¬ 
ber of 2D landmarks; (ii) the impact of the number of 3D 
vertices; (Hi) the way of using standalone automated land¬ 
mark detection methods; and (iv) the convergence property. 
To answer these questions, a simplified cascaded regression 
based 3D face reconstruction method is devised, which can 
be integrated with standalone automated landmark detec¬ 
tion methods and reconstruct 3D face shapes that have the 
same pose and expression as the input face images, rather 
than normalized pose and expression. Moreover, an effec¬ 
tive training method is proposed by disturbing the auto¬ 
matically detected landmarks. Comprehensive evaluation 
experiments have been done with comparison to other 3D 
face reconstruction methods. The results not only deepen 
the understanding of cascaded regression based 3D face re¬ 
construction approaches, but also prove the effectiveness of 
proposed method. 

1. Introduction 

As a fundamental problem in computer vision, recon¬ 
structing three dimensional (3D) face shapes from two di¬ 
mensional (2D) images has recently gained increasing at¬ 
tention because of the 3D face provides invariant features 
to variations of pose, illumination, and expression. The 
reconstructed 3D faces are therefore useful for many real- 
world applications, for example, pose robust face recogni¬ 
tion El HU El [351, 3D facial expression analysis (TT] |24l 
and facial animation Elllol. Using 3D face shape to rec¬ 
ognize identities is believed to be more robust and more ac¬ 
curate than using only 2D face images m. Despite its high 
recognition accuracy, fast acquisition of high resolution and 
high precision 3D face shapes is still difficult, especially un¬ 
der varying conditions or at a distance. On the other hand. 


2D face images can be much more easily captured with the 
widespread cameras, and there are already plenty of 2D face 
image databases. It is thus highly demanded to develop ef¬ 
ficient methods for reconstructing 3D faces from 2D face 
images such that the rich resources of 2D face images and 
facilities can be better utilized. 

A novel method (23 has recently been proposed for re¬ 
constructing 3D face shapes from single 2D images via cas¬ 
caded regression in 2D/3D shape space. It is based on the 
observation that the landmarks’ locations on the 2D im¬ 
age can be derived from the reconstructed 3D shape, and 
the displacement of derived landmarks from their true posi¬ 
tions is correlated with the accuracy of the reconstructed 3D 
shape. This method can simultaneously locate facial land¬ 
marks and reconstruct 3D face shapes with two sets of cas¬ 
caded regressors, one for updating landmarks and the other 
for 3D face shapes. By effectively exploring the correlation 
between 2D landmarks and 3D shapes, this method achieves 
state-of-the-art performance in both face alignment and 3D 
face reconstruction for face images of arbitrary view and 
expression. Some problems are, however, still not well an¬ 
swered with regard to such shape space regression based 3D 
face reconstruction methods: 

• Impact of the number of 2D landmarks. Different 
sets of 2D landmarks are used in the face alignment 
and recognition literature, e.g., 68 landmarks 1^ . 21 
landmarks csi and 5 landmarks Ga. How will the 
3D face reconstruction accuracy be affected if different 
numbers of 2D landmarks are used to guide the 3D 
face reconstruction process? 

• Impact of the number of 3D vertices. 3D face shapes 
can be represented by different numbers of vertices, 
i.e., different 3D point cloud densities and coverage. 
Will a sparse or narrow 3D face shape be more easily 
to be reconstructed with higher accuracy than a dense 
or wide 3D face shape Q? 

^ A wide 3D face shape covers more areas than a narrow 3D face shape. 
For instance, the 3D face shape covering only eyes, eyebrows, nose and 
mouth is narrow compared with the 3D face shape covering the area from 
left ear to right ear. 
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• What if using standalone landmark localization meth¬ 
ods? Although the method in 1221 can simultaneously 
locate 2D landmarks and reconstruct 3D shapes, it re¬ 
quires that the training 2D face images should be anno¬ 
tated with both visible and invisible landmarks. Manu¬ 
ally marking invisible landmarks is, however, very dif¬ 
ficult and error-prone. Is it possible to integrate stan¬ 
dalone landmark localization methods with the 3D face 
reconstruction process proposed in 1^ ? 

• Convergence. As an iterative approach, how many it¬ 
erations would be necessary for the proposed method 
to achieve acceptable performance in terms of both ac¬ 
curacy and efficiency? In other words, what is the con¬ 
vergence property of shape space regression based 3D 
face reconstruction methods? 

The goal of this paper is to investigate the shape space 
regression based 3D face reconstruction approach from the 
aforementioned four aspects. To this end, we first revise 
and implement the method in 1^ so that the 3D face re¬ 
construction process can take 2D landmarks that are pro¬ 
vided by a third party as input, and reconstruct 3D face 
shapes that have the same pose and expression as the in¬ 
put images, rather than frontal pose and neutral expression 
. See Fig. 1 for the results of the method on some photos 
from the AFW database |[36l using the ground truth visible 
2D landmarks as input. We then experimentally evaluate 
the convergence and computational complexity of the im¬ 
plemented 3D face reconstruction method. Afterwards, we 
conduct extensive experiments to assess the impact of the 
number of 2D landmarks and the number of 3D vertices on 
reconstruction accuracy. We finally make an attempt to in¬ 
tegrate state-of-the-art landmark localization methods to the 
3D face reconstruction process. 

The rest of this paper is organized as follows. Section 2 
reviews the related work. Section 3 and Section 4 present in 
detail the shape space regression based 3D face reconstruc¬ 
tion method and its implementation. Section 5 reports the 
experimental results. Section 6 finally concludes the paper. 

2. Related Work 

In order to solve the intrinsically ill-posed single-view 
3D face reconstruction problem, different priors or con¬ 
straints have been introduced, resulting in the Shape from 
Shading (SFS) based methods and 3D Morphable Model 
(3DMM) based methods. SFS based methods (131 El re¬ 
cover 3D shapes via analyzing certain clues in the 2D 
texture images, with an assumption of the Lambertian re¬ 
flectance and a single-point light source at infinity. While 
classical SFS based methods (TSl 123 [20l [3T1 are initially 
designed for generic 3D shape reconstruction, their per¬ 
formance in recovering 3D face shapes can be further im¬ 
proved by using some reference 3D face models as addi- 



Figure 1. Reconstruction results of the proposed method on face 
images from the AFW database |[^ with arbitrary expressions 
and poses. 
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Figure 2. Reconstruction results for images in the Basel Face 
Model (BFM) (top row) 1^ and BU3DFE (bottom row) 
databases. From left to right columns: Input images, ground 
truth 3D shapes (GT), and results by 3DMM ID, E-3DMM (351, 
SES CD and our proposed method. 

tional constraints. These methods usually have limited ac¬ 
curacy because (i) their assumed connection between 2D 
texture clues and 3D shape information is too weak to dis¬ 
criminate between different human faces, (ii) they do not 
fully exploit the prior knowledge of 3D faces and signifi¬ 
cantly depend on the reference models, and (iii) they recon¬ 
struct a depth map or 2.5D shape instead of a 3D full shape 
since they tend to operate on a face with a narrow range of 
poses. 

3D Morphable Model (3DMM) E EH H Ell El III Bl, 
as a typical statistical 3D face model, explicitly learns the 
prior knowledge of 3D faces with a statistical parametric 
model. It represents a 3D face as a linear combination of 
basis 3D faces, which are obtained by applying principal 
component analysis (PCA) on a set of densely aligned 3D 
faces. To recover the 3D face from a 2D image, the com¬ 
bination coefficients are estimated by minimizing the dis- 








crepancy between the input 2D face image and the one ren¬ 
dered from the reconstructed 3D face. These 3DMM based 
methods can better cope with 2D images of varying illumi¬ 
nations and poses. However, they are limited in individual¬ 
ized or detail reconstruction because PCA conducts global 
modeling in essence, and they involve a time-consuming 
on-line optimization process to search for optimal solution 
in the parameter space. Moreover, 3DMM needs an addi¬ 
tional linear expression model to handle facial expressions, 
namely E-3DMM E] [H [H . However, neither SFS-based 
nor 3DMM-based methods can consistently well cope with 
rotated or expressive face images due to invisible or de¬ 
formed facial landmarks on them. 

Motivated by the success of cascaded regression in 2D 
facial landmark localization (291 El ED, the authors re¬ 
cently proposed in li2^ a 2D/3D shape space regression 
based method for reconstructing 3D face shapes from sin¬ 
gle images of arbitrary views and expressions. The method 
alternately applies 2D landmark regressors and 3D shape 
regressors. The 2D landmark regressors estimate landmark 
locations by regressing over the texture features around 
landmarks, while the 3D shape regressors reconstruct 3D 
face shapes via regressing over the 2D landmarks. Un¬ 
like existing 3D face reconstruction methods, this method 
directly estimates 3D faces in the 3D shape space via 
cascaded regression, getting rid of parameterized 3D face 
models and assumed illumination models. As a result, it 
achieves state-of-the-art performance for both accuracy and 
efficiency of 3D face reconstruction. Figureshows exam¬ 
ple results of SFS-based, 3DMM-based, E-3DMM-based 
and shape-space-regression-based methods on rotated and 
expressive face images. In this paper, we will thoroughly as¬ 
sess the effectiveness of such shape space regression based 
3D face reconstruction methods from various perspectives. 
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Figure 3. Flowchart of the shape space cascaded regression based 
3D face reconstruction method. Green and red points denote, re¬ 
spectively, visible and invisible landmarks. Note that the method 
in this paper does not require invisible landmarks’ locations as in¬ 
put. 


Here, we employ weak perspective projection for M as con¬ 
ventionally done in the literature (^ . 

Our purpose in this paper is to reconstruct S (rather 
than S) from the given “ground truth” visible landmarks 
U* (either manually marked or automatically detected by 
a standalone method) for the face image I. As discussed 
above, we achieve this by iteratively updating the initial es¬ 
timate of S with a series of regressors in the 3D face shape 
space. These regressors calculate the adjustment to the esti¬ 
mated 3D face shape according to the deviation between the 
ground truth landmarks and the landmarks rendered from 
the estimated 3D face shape. Figure shows the flowchart 
of the proposed method. 


3. Shape Space Regression based Approach 

3.1. Overview 

We denote a 3D face shape as S G which is rep¬ 

resented by 3D locations of n vertices, and a subset of S 
with columns corresponding to I annotated landmarks (e.g., 
eye corners and nose tip) as S^. The projections of these 
3D landmarks on the 2D face image I are represented by 
U G The relationship between 2D facial landmarks 

U and its corresponding 3D landmarks S l can be described 
as: 

U = MSl = MDn{RSl + T), (1) 

where S is a frontal 3D face with neutral expression, {Re 
]R3x 3 2" e ]R3 xZ| Dn{') are, respectively, rigid defor¬ 
mation (i.e., rotation and translation) caused by pose varia¬ 
tions and non-rigid deformation function caused by expres¬ 
sion variations that occur to S resulting in the observed 3D 
face S, and M G the camera projection matrix. 


3.2. The Reconstruction Process 

Let U* be the “ground truth” landmarks (either manually 
annotated or automatically detected) on an input 2D image, 
and the currently reconstructed 3D shape after k — 
1 iterations. The corresponding landmarks can be 

obtained by projecting onto the image according to 
Eqn. 1^. Then the updated 3D shape can be computed 
by 

g/c ^ g/c-1 ^ - U^-1) + b^, (2) 

where is the regressor in iteration and is a bias 
term (in the rest of this paper we omit the bias term for 
simplicity sake because it can be shrunk into the regressors). 

3.3. Learning Cascaded Regressors 

The K regressors involved in the reconstruction 

process can be learned via optimizing the following objec- 






















live function over the N training samples: 

N 

argmin^ || (S* - - W'=(U* - ||1, (3) 

w'= 

where {S*, U* } is one training sample consisting of ground 
truth landmarks on the 2D face image and its corre¬ 
sponding ground truth 3D face shape that has the same 
pose and expression as the face image. Mathematically, the 
above optimization seeks for a regressor that can minimize 
the overall error of the entire reconstructed 3D face shapes, 
but not merely the error at the landmarks. 

In this paper, we use linear regressors G ^ . 

The optimization in Eqn. can be then easily solved by 
using least squares methods with a solution of 

= A§^(AU^)'^(AU^(AU^)'^)-\ (4) 


where AS^ = §* — and AU^ = U* — are 3D 
shape adjustment and 2D landmark deviation. S G 
and U G denote, respectively, the ensemble of 

3D face shapes and 2D landmarks of all training samples 
with each column corresponding to one sample. Note that, 
here, we write 3D face shape and 2D landmarks as column 
vectors: S = (xi, zi, X 2 , ^ 2 , ^ 2 , • * • , ^n, ^n, ^n)"^ and 
U = (r^i, t’l,'^ 2 , • • • 5 '^0^ CT’ denotes transpose 
operator). It can be mathematically shown that, to ensure 
a valid solution in Eqn. 0 , N should be larger than 21 so 
that AU^(AU^)''' is invertible. Eortunately, since the set of 
used landmarks are usually sparse, this requirement can be 
easily satisfied in real-world applications. 


Algorithm 1 3D Cascaded Regressor Learning 
Require: Training data {(1^, S*, U*)|i = 1,2,-•• ,A}, 
initial shape & camera projection matrix M. 
Ensure: Cascaded regressors |W^|? 1 . 

1 : for k = 1 ,..., K do 

2 : Estimate 2D projection from current 3D face 

viaEq. 

3: Compute 2D landmark adjustment and 3D face ad¬ 

justment for all samples: AU^ = HJ* — 

A§^ = §* 

4: Estimate via Eqn. ([^; 

5: Update 3D face Sf via Eqn. 

6: end for 


4. Implementation Details 
4.1. Initialization 

The proposed iterative method has two terms to initial¬ 
ize: the initial 3D face shape and the camera projection 
matrix M. Given the set of training samples, we select out 



Figure 4. Sixty-eight landmarks are used in this work. Left: Land¬ 
marks annotated on a 3D face. Middle and Right: Corresponding 
landmarks annotated on its 2D images with yaw angle of 20° and 
40°. Green and red points on the 2D images indicate, respectively, 
visible and invisibile landmarks, and blue points mark the contour 
instead of semantic landmarks. 


from them all the frontal faces with neutral expression. The 
mean of these selected 3D face shapes is computed and used 
to initialize S^. Similarly, the mean of their 2D landmarks is 
also calculated and denoted as U^. The camera projection 
matrix M can be then estimated by solving the following 
least squares fitting problem: 

M = argmin II U°-M X si Hi . (5) 

M 

The obtained projection matrix M is used throughout the 
3D face reconstruction process to render 2D landmarks 
from the reconstructed 3D face shapes. 

4.2. Landmarks 

Eigure depicts the sixty-eight facial landmarks (/ = 
68 ) considered in this paper. Obviously, some of the land¬ 
marks will become invisible on the 2D face images due to 
self-occlusion when the face has large pose angles. These 
invisible landmarks are difficult to be precisely annotated. 
Hence, we treat them as missing data, and fill their corre¬ 
sponding entries in U with zero. This way, these invisible 
landmarks will not affect the reconstruction, and thus im¬ 
ages of arbitrary pose angles can be handled in a unified 
framework. 

To automatically detect the visible landmarks in test¬ 
ing phase, we first employ state-of-the-art face alignment 
approach to automatically locate 2D landmarks positions, 
and then compute their visibility. Most conventional face 
alignment methods like El can not detect invisible self- 
occluded landmarks (refer to the red point in Eig. |^. In 
order to determine the visibility of 2D landmarks projected 
from the reconstructed 3D face shape, given the detected 
2D landmarks U on the face image and the 3D annotated 
landmarks from the initial 3D shape S^, we coarsely 
estimate the camera projection matrix M by Eqn. Sup¬ 
pose the 3D surface normal at landmarks in is N. The 
initial visibility v can be then measured by ca 


v 


1 

2 


^1 + sgn 



Ml M2 \\\ 

IIMill ^ IIM2I1UT 


( 6 ) 









where 5^n() is the sign function, means dot product and 
‘ X ’ cross-product, and Mi and M 2 are the left-most three 
elements at the first and second row of the mapping matrix 
M. This basically rotates the surface normal and validates 
if it points toward the camera or not. Finally, to maintain 
the consistence with the training setting, the invisible corre¬ 
sponding entries in U should be filled with zero. 

4.3. Alignment 

For the sake of simplifying the camera projection model, 
we assume that both 3D face shapes and 2D landmarks are 
well aligned. More specifically, (i) all the 3D face shapes 
have been established point-to-point dense registration (i.e., 
they have the same number of vertices, and the vertices of 
the same index have the same semantic meaning); (ii) all the 
3D face shapes are centered at the origin of the world coor¬ 
dinate system; and (iii) all the faces on the 2D images are 
also centered in the image coordinate system. With these 
aligned 3D&2D face data, and as we separate face defor¬ 
mation from camera projection (see Eqn. the employed 
weak perspective camera projection matrix M has only one 
free parameter, i.e., scaling factor or focal length, which 
will be estimated based on the training data. 

5. Experimental Results 
5.1. Training Data 

A set of 3D face shapes and corresponding 2D face im¬ 
ages with annotated landmarks is needed to train regressors 
in the proposed method. To make the trained regressors ro¬ 
bust to pose and expression variations, samples in the train¬ 
ing dataset should have good diversity in their poses and 
expressions. It is, however, difficult to find in the public do¬ 
main such datasets of 3D face shapes and corresponding an¬ 
notated 2D images with various expressions/poses. There¬ 
fore, we use the Basel Face Model (BFM) (2^ to construct 
synthetic 3D faces of 200 subjects (50% female), and use 
the expression model from FaceWarehouse ii to generate 
random expressions on each of the 3D faces. These expres¬ 
sive 3D faces are then projected onto 2D images with 55 
views of 11 yaw (0°, ±15°, ±30°, ±50°, ±70°, ±90°) and 
5 pitch (0°,±15°,±30°) rotations, resulting in a total num¬ 
ber of 11,000 3D faces and corresponding synthetic images. 
Each 3D face consists of 53,215 vertices (the original BFM 
model has 53,490 vertices, but we discard the vertices in 
tongue region). The 2D image resolution is 875 x 656 pixels 
and the inter-eye distance is about 220 pixels. The 68 land¬ 
marks on each 2D face image are recorded during the pro¬ 
jection process (note that the 3D faces are densely aligned 
and the indices of the landmarks in the 3D face shapes are 
known), and the invisible landmarks are marked as zero as 
mentioned above. 


5.2. Convergence and Computational Complexity 

In this section, we experimentally investigate the con¬ 
vergence of the training process of the proposed cascaded 
regressors. To this aim, we record down the value of the 
objective function defined in Eqn. ^ at each iteration dur¬ 
ing the training process. Figurej^ shows the objective func¬ 
tion value for 10 iterations. It can be clearly seen that the 
objective function value decreases substantially in the first 
five iterations, and becomes stable after seven iterations. 
This demonstrates the good convergence of the proposed 
method. In the following experiments, we empirically set 
Ff = 5 as a trade-off between accuracy and efficiency. 

According to our experiments on a PC with i7-4710 CPU 
and 8 GB memory, the Matlab implementation of the pro¬ 
posed method runs at ^ 26 frames per second (EPS). This 
indicates that the proposed method can reconstruct 3D faces 
in real time. 



Figure 6. The objective function values as iteration proceeds. 

5.3. Reconstruction Accuracy across Poses on BFM 

The BFM database 12^ provides 10 test face subjects, 
each of whom has nine face images of neutral expression 
and different poses, including one frontal and eight yaw 
poses (±15°, ±30°, ±50°, ±70°). Here, the metric used to 
evaluate the 3D face shape reconstruction accuracy is Mean 
Absolute Error (MAE). MAE is defined as 

Nt 

MAE = — ^(||S*-Si||/n), (7) 

^ i=l 

where Nt is the total number of test samples, ||S* — || is 

the Euclidearn distance between ground truth shape S* and 
reconstructed 3D shape of the test sample. We report 
the MAE in mm after Procrustes alignment. 

In this experiment, we use the visible landmarks pro¬ 
jected from ground truth 3D face shapes as input. The 
proposed method is compared with several state-of-the-art 
methods based on 3DMM, including the approach proposed 
by Aldrian and Smith m, the multi-features 3DMM frame¬ 
work based on contours, textured edges, specular highlights 
and pixel intensity proposed by Romdhani et al. (251, Sparse 
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Figure 5. Reconstruction results for two BFM samples at 9 different poses. First row: The input images. Second and forth rows: The 
reconstructed 3D face shapes by the method of SSF-3DMM Oil and our proposed method. Third and fifth rows: Their corresponding 
MAE error maps. The colormap goes from dark blue to dark red (corresponding to an error between 0 and 10). The numbers under each 
of the error maps represent mean and standard deviation values (mm). 


Table 1. MAE (mm) of the proposed method and four state-of-the-art methods at different poses with ground truth landmarks. 
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SIFT Flow 3DMM (SSF-3DMM, lEl), and the edge¬ 
fitting based 3DMM approach proposed by Bas et al. ll4l . 

Table[T]shows the MAE of different methods on the BFM 
database with respect to different poses of face images. As 
can be seen, average MAE of the proposed method is ob¬ 
viously lower than that of the counterpart methods. More¬ 
over, its accuracy is very stable across different poses. This 
proves the effectiveness of the proposed method in handling 
face images of arbitrary poses. Figure shows the recon¬ 
struction results of our method and SSF-3DMM iJTl on two 
subjects in the BFM database. 

5.4. Impact of the Number of 2D Landmarks 

In order to assess how the reconstruction accuracy 
changes as fewer landmarks are used, we divide face into 
four regions, i.e., nose, eyes, mouth and other (see Fig. 
[7]), and use different numbers of landmarks in these re¬ 
gions. Note that the number of vertices in the output recon¬ 
structed 3D face shape remains unchanged. Figure [7] shows 
the results, from which one can observe that while using 
more landmarks boosts the reconstruction accuracy for all 
regions, the gains of different regions are not uniform. A 
possible explanation is due to the varying complexity of 
different regions and to the different significance of differ¬ 
ent landmarks. For a better evaluation of the impact of 2D 
landmarks, more extensive experiment is needed, which is 
among our future work. In the rest experiments, we will use 
the set of 68 landmarks (unless specified otherwise). 

5.5. Impact of the Number of 3D Vertices 

In this experiment, we study the reconstruction precision 
of 3D face shapes with different number of vertices. As we 
know, facial components including eyes, nose, mouth and 
eye-brows are the most discriminative part for face recog¬ 
nition, and thus it is demanded that more accurate facial 
component shapes can be obtained. Being aware of this, 
we assess the reconstruction accuracy as fewer non-facial- 
component vertices are used (i.e., the coverage of 3D point 
cloud becomes more focused on facial components) and the 
number of input 2D landmarks remains unchanged (i.e., 51 
landmarks located on nose, eyes and mouth are used). Two 
MAEs are computed based on the whole set of 3D vertices 



Figure 7. MAE of the proposed method in nose, eyes, mouse and 
the other regions on the BFM test samples when different 2D land¬ 
marks are used. The bottom row shows the vertex-wise MAE 
maps, in which errors increase from blue to red. 
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Figure 8. MAEs of the proposed method over the whole set of 
3D vertices (blue curve) and the subset of facial component ver¬ 
tices (red curve) on the BFM test samples as more vertices are 
included in the reconstructed 3D face shape and the used 51 land¬ 
marks remain unchanged. Vertex-wise MAE map shows the MAE 
per vertex in the 3D face (errors increase from blue to red). 

and on the subset of facial component vertices, respectively. 
As can be seen from the results in Fig.[^ the MAE over the 
whole set increases (by more than 0.5mm) as more non¬ 
facial-component vertices are required to be reconstructed. 
This is because the used landmarks do not provide suffi¬ 
cient constraints on non-facial-component vertices. In con¬ 
trast, the MAE over the facial component vertex subset is 























Table 2. MAE with automatically detected landmarks on BFM database. 
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3.24 

3.30 

3.21 

Proposed II CESS 

3.17 

3.04 

3.00 

3.01 

3.01 

3.08 

3.26 
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not affected by the vertices outside the facial component 
area. From Eqn. 0. we can see that every vertex in the 
reconstructed 3D face shape is fully determined by the in¬ 
put landmarks, and different vertices are independent from 
each other in their reconstruction errors. This explains the 
two curves in Fig.[^ 



Besides, we fix the coverage of 3D point cloud to the fa¬ 
cial component region, and evaluate the reconstruction ac¬ 
curacy when different numbers of 3D vertices in that region 
are reconstructed (i.e., the point cloud density changes by, 
for example, uniform downsampling). Figure shows the 
results, which indicate that the overall reconstruction accu¬ 
racy is reduced slightly (by less than 0.001mm) as the num¬ 
ber of reconstructed 3D vertices decreases. This is again 
mainly because of the independence between different ver¬ 
tices as mentioned before. But, on the other hand, solving 
the optimization problem in Eqn. 0 is essentially to make 
a balance of reconstruction errors both among all training 
samples and among all the vertices in the reconstructed 3D 
face shape. Thus, different sets of vertices will theoretically 
result in different “balances”. Fortunately, as long as the 
2D landmarks can provide sufficient constraints on the re¬ 
constructed region of the 3D face, the point cloud density 
in the reconstructed 3D face region has little effect on the 
reconstruction accuracy (also recall the results in Fig. [^that 
additional vertices outside the facial component region do 
not change the reconstruction accuracy inside that region 
when facial component landmarks are used to guide the re¬ 
construction). This is a favorite property of the proposed 
method, which enables people to reconstruct 3D faces of a 
higher resolution at the same precision without extra cost 
except computational complexity (due to a higher dimen¬ 
sional regression output). 



Figure 9. MAE of the proposed method over a fixed region of 3D 
face when different numbers of vertices are used to represent that 
region. 


5.6. Using Standalone Landmark Localization 
Methods 

In the above evaluation experiments, the 2D visible land¬ 
marks are obtained from the ground-truth 3D shapes. In 
this experiment we use landmarks that are automatically de¬ 
tected by several different methods, including SDM 1291 . 
DLIB O, TCDCN (321, and CFSS (H, as the “ground 
truth” landmarks. Considering the potential errors in auto¬ 
matically detected landmarks, we disturb the ground truth 
landmarks of training data by zero-mean Gaussian noise 
with standard deviation of 25 to improve the robustness 
of the obtained regressors. We conduct two series of ex¬ 
periments: (i) training using data with ground-truth land¬ 
marks (denoted as Proposed I), and (ii) training using data 
with disturbed landmarks (denoted as Proposed II). In this 
experiment, the approaches of Romdhani et al. 1^ . E- 
3DMM EU and Bas et al. (H are selected as the base- 











lines. We use the authors’ own implementations with au¬ 
tomatically detected landmarks. In this more challenging 
scenario, as shown in Tableour method trained with dis¬ 
turbed landmarks gives the best overall performance and 
is superior for all pose angles, especially with DLIB face 
alignment method. Compared with the results obtained 
by using the landmarks generated from ground truth 3D 
face shapes (see Table [^, the accuracy by using automati¬ 
cally detected landmarks is worse (MAE has been increased 
from 2.30mm to 3.34mm), but can be successfully im¬ 
proved via disturbing the detected landmarks during train¬ 
ing (3.05mm). 



Expression 


Figure 10. Average Normalized Per-vertex Depth Errors (NPDE) 
of the proposed and two counterpart methods for different expres¬ 
sions in the BU3DFE database. 


5.7. Reconstruction Accuracy across Expressions 
on BU3DFE 

The BU3DFE database (301 contains 3D faces of 100 
subjects displaying seven expressions of neutral (NE), hap¬ 
piness (HA), disgust (DI), fear (EE), anger (AN), surprise 
(SU) and sadness (SA). All non-neutral expressions were 
acquired at four levels of intensity. We selected neutral and 
the first level intensity of the rest six expressions as testing 
sets, resulting in 700 testing samples. The reconstruction 
error is measured by Normalized Per-vertex Depth Error 
(NPDE). NPDE is defined by the depth error at each ver¬ 
tex of the test sample as 


NPDE(a:,-,yj) = {\z* - %|) / , (8) 


where and are the maximum and minimum 

depth values in the ground truth 3D face shape of the test 
sample, and Zj and % are the ground truth and recon¬ 


structed depth values at the vertex. Eigure 10 shows 
the accuracy of the proposed method as well as two coun¬ 
terpart methods for different expressions in the BU3DEE 
database. It can be seen that the proposed method achieves 
the lowest error for all the expressions. It successfully re¬ 
duces the overall average reconstruction error from 4.89% 
of SES da and 3.10% of E-3DMM (SI to 2.03%. Eig¬ 
ure [n] shows the reconstruction results of our method. 


SES (Ta and E-3DMM (St on one subject under seven 
expressions. 

6. Conclusions 

In this paper, we have thoroughly investigated with com¬ 
prehensive experiments the cascaded regression based 3D 
face reconstruction approach recently proposed in (22l- Our 
experimental results show that (i) more landmarks are gen¬ 
erally helpful for accurate 3D face reconstruction, but dif¬ 
ferent facial components have different gains from the in¬ 
creased landmarks; (ii) the overall 3D face reconstruction 
accuracy will be degraded if more areas are covered by the 
reconstructed 3D faces while the used landmarks remain the 
same; (iii) the reconstruction accuracy for a specific face 
area is not affected by the 3D point cloud density in that 
area or the 3D vertices outside that area as long as the in¬ 
put landmarks are not changed; (iv) using standalone auto¬ 
mated facial landmark detection methods together with the 
cascaded regression based 3D face reconstruction methods 
is feasible, and the reconstruction accuracy can be improved 
by disturbing the detected landmarks during training; (v) the 
cascaded regression based 3D face reconstruction methods 
have good convergence property. In addition, the revised re¬ 
construction method together with its training method pro¬ 
vide a feasible alternative approach to 3D face reconstruc¬ 
tion for which the training data can be more easily prepared 
than in because invisible landmarks’ locations are not 
required to be annotated. In the future, given the impressive 
accuracy and efficiency of the cascaded regression based 
3D face reconstruction approach, we are going to apply it 
to unconstrained face recognition in real-world scenarios. 
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